In May 2013, the Centers for Medicare and Medicaid Services released data on the price Medicare fee-for-service health services at hospitals around the US in year 2011. They wrote:
As part of the Obama administration’s work to make our health care system more affordable and accountable, data are being released that summarize the utilization and payments for procedures and services provided to Medicare fee-for service beneficiaries by specific inpatient and outpatient hospitals, physicians, and other suppliers. These data include information for the 100 most common inpatient services, 30 common outpatient services, and all physician and other supplier procedures and services performed on 11 or more Medicare beneficiaries. Providers determine what they will charge for items,services, and procedures provided to patients and these charges are the amount the providers bill for an item, service, or procedure. -source
This attracted press attention. For example, a headline in the New York Times read “Hospital Billing Varies Wildly, Government Data Shows”.
A map with the article shows the geographic disparities in hospital charges.
The news reports emphasized geographic disparities in the billing for procedures. I’m interested to know whether this might depend on other variables, for instance:
The data themselves are distributed by the Centers for Medicare and Medicade Services as a zip-compressed CSV file.
The download page describes the formats available.
Data are available in two formats:
- Tab delimited file format (requires importing into database or statistical software; SAS® read-in language is included in the download ZIP package)
- Microsoft Excel format (.xlsx), split by provider last name (note: organizational providers with name starting with a numeric are available in the “YZ” file).
CMS has also created two summary tables: 1) aggregated information by physician or other supplier and 2) aggregated information by State and HCPCS code. A detailed methodology document can be found in the Downloads section below which contains important information regarding the limitations of data usage.
Tab Delimited Format:
Medicare Physician and Other Supplier PUF, CY2012, Tab Delimited format [Note: This Compressed ZIP package contains the tab delimited data file (Medicare-Physician-and-Other-Supplier-PUF-CY2012.txt) which is 1.7GB uncompressed and contains more than 9 million records, thus importing this file into Microsoft Excel will result in an incomplete loading of data. Use of database or statistical software is required; a SAS® read-in statement is supplied. Additionally, this ZIP package contains the following supporting documents: CMS-AMA-CPT-2011-license-agreement.pdf, Medicare-Physician-and-Other-Supplier-PUF-SAS-Infile.sas, Medicare-Physician-and-Other-Supplier-PUF-Methodology.pdf]
Microsoft Excel Format:
Medicare Physician and Other Supplier PUF, CY2012, Microsoft Excel (.xlsx) Provider Last Name (A)
Medicare Physician and Other Supplier PUF, CY2012, Microsoft Excel (.xlsx) Provider Last Name (B)
Medicare Physician and Other Supplier PUF, CY2012, Microsoft Excel (.xlsx) Provider Last Name (CD)
Medicare Physician and Other Supplier PUF, CY2012, Microsoft Excel (.xlsx) Provider Last Name (EFG)
… and so on.
The compressed tab-delimited file is 395 MB. Uncompressed it’s 1700 MB. This seems large.
How do I even look at these data? Could a student manage this in the time frame of, say a course project?
I downloaded the tab-delimited file (taking about 45 minutes with a broadband connection), uncompressed it (taking about 0.5 % of my disk space), read it into R, and then saved it in a standard R format. It’s now part of the DCFdevel package.
fileName <- "/Users/kaplan/KaplanFiles/DCF-2014/CaseStudies/MedicareSpending/CMS-data-2013-05-06.rda"
load(fileName)
MedicareSpending <- med
Perhaps surprisingly, all that data, in the standard R format, is only NA MB. What accounts for the disparity? Do I have the right data?
The New York Times says:
The data for 3,300 hospitals, released by the federal Centers for Medicare and Medicaid Services, shows wide variations not only regionally but among hospitals in the same area or city.
A quick summary of the data using commands that students might encounter in an introductory statistics course:
nrow( MedicareSpending )
## [1] 163065
names( MedicareSpending )
## [1] "DRG.Definition"
## [2] "Provider.Id"
## [3] "Provider.Name"
## [4] "Provider.Street.Address"
## [5] "Provider.City"
## [6] "Provider.State"
## [7] "Provider.Zip.Code"
## [8] "Hospital.Referral.Region.Description"
## [9] "Total.Discharges"
## [10] "Average.Covered.Charges"
## [11] "Average.Total.Payments"
with( MedicareSpending, length(levels(Provider.Name) ) )
## [1] 3201
That corresponds well with the New York Times description.
So why are there 163065 rows if there are only 3201 hospitals? Because there is a separate listing for each of 100 procedures (known as a Direct Recovery Group (DRG)).
with( MedicareSpending, length(levels(DRG.Definition)))
## [1] 100
A typical provider is listed roughly 50 times. More precisely, each provider is listed this many times on average:
nrow( MedicareSpending ) /
with( MedicareSpending, length(unique(Provider.Name)))
## [1] 50.94189
That’s an ugly command. MedicareSpending is listed twice.
We’re going to be using a different system for data calculations, called dplyr.
MedicareSpending %>%
summarize( aveProc=n()/length(unique(Provider.Name)))
## aveProc
## 1 50.94189
As part of a sanity check of the data, look at how much Medicare paid out for these services. We expect that it will be several billion dollars.
MedicareSpending %>% summarize(
TotalPaid=sum(Total.Discharges*Average.Total.Payments,na.rm=TRUE),
TotalCharged=sum(Total.Discharges*Average.Covered.Charges,na.rm=TRUE)
) / 1e9
## TotalPaid TotalCharged
## 1 66.68952 245.9779
$246 billion was charged. “Only”" $67 billion was paid
MedicareSpending %>% group_by(Provider.State) %>%
summarize(
TotalPaid = mean( Average.Total.Payments, na.rm=TRUE ),
TotalCharged = mean( Average.Covered.Charges, na.rm=TRUE )
) -> StateCharges
tmp <- transform( StateCharges, State=reorder( Provider.State, TotalPaid ))
ggplot( melt(tmp,variable.name="Amount"), aes(x=State, y=value/1000, color=Amount)) +
geom_point( ) + scale_y_log10(breaks=c(1,5,10,15,20,30,40)) +
xlab("State") + ylab("1000 USD per procedure")
## Using Provider.State, State as id variables
Maybe better to show the variation within states:
tmp <- select(MedicareSpending, State=Provider.State,
Charged=Average.Covered.Charges,
Paid=Average.Total.Payments)
tmp %>% transform(State=reorder(State, Paid, FUN=median)) -> tmp
tmp2 <- melt(tmp,variable.name="Amount")
## Using State as id variables
ggplot( tmp2, aes(x = State, group = State, y = value/1000)) +
geom_boxplot( ) + scale_y_log10(breaks = c(1,5,10,15,20,30,40,80,160,320)) +
xlab("State") + ylab("1000 USD per procedure") + facet_grid( ~Amount )
Now by DRG:
# Transform the codes to just their numbers.
DRGnums <- substr(as.character(levels(MedicareSpending$DRG.Definition)),0,3)
MedicareSpending$DRG <- factor(MedicareSpending$DRG.Definition, labels=DRGnums)
MedicareSpending %>% select( State=Provider.State,
Charged=Average.Covered.Charges,
Paid=Average.Total.Payments,
DRG) -> tmp
tmp %>% mutate(DRG=reorder(DRG, Paid, FUN=median)) %>%
sample_n(size=100000) -> tmp
tmp2 <- melt(tmp,variable.name="Amount")
## Using State, DRG as id variables
ggplot( tmp2, aes(x = DRG, y = value/1000)) +
geom_boxplot( ) + scale_y_log10(breaks = c(1,5,10,15,20,30,40,60,80,100,150,200,300,500)) +
xlab("DRG") + ylab("1000 USD per procedure") + facet_grid( ~Amount )
nstates <- 3
tmp <- arrange(StateCharges,TotalPaid)
extremes <- rbind(head(tmp,nstates),tail(tmp,nstates))$Provider.State
extremes <- c(as.character(extremes),'NY','CA','TX','MN')
MedicareSpending %>%
select( State=Provider.State,
# Charged=Average.Covered.Charges,
Paid=Average.Total.Payments,
DRG) %>%
mutate(DRG=reorder(DRG, Paid, FUN=median)) %>%
filter( State %in% extremes) -> ExtremeStates
ggplot( ExtremeStates, aes(x = DRG, y = Paid/1000)) +
geom_boxplot( ) + scale_y_log10(breaks = c(1,5,10,15,20,30,40,60,80,100,150,200,300,500)) +
xlab("DRG") + ylab("1000 USD per procedure") + facet_wrap( ~ State, nrow=3 )
The DRGs vary in the same way within a state. Let’s pick one DRG to investigate. Perhaps the one that appears in the most states.
MedicareSpending %>% select( State=Provider.State,
Charged=Average.Covered.Charges,
Paid=Average.Total.Payments,
DRG, ZIP=Provider.Zip.Code) -> SmallMed
SmallMed %>% group_by(DRG) %>%
summarize( med = median(Paid), nStates=length(unique(State)),
nZips=length(unique(ZIP))) %>%
arrange(med) %>%
filter( rank(med) %in% c(45:55))
## Source: local data frame [11 x 4]
##
## DRG med nStates nZips
## 1 394 6600.800 51 1460
## 2 699 6625.232 49 879
## 3 491 6637.500 51 909
## 4 439 6711.474 49 918
## 5 176 6848.804 51 1344
## 6 287 7146.876 51 1623
## 7 872 7264.896 51 2364
## 8 065 7280.053 51 2141
## 9 281 7491.864 51 1495
## 10 640 7506.304 51 1644
## 11 190 7531.801 51 2568
Let’s take DRG 872: SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W/O MCC
SmallMed %>% filter( DRG=='872') -> DRG872
This service is provided in 2505 ZIP codes. Almost all of those ZIP codes have only one provider:
table(table(DRG872$ZIP))
##
## 1 2 3 4
## 2231 126 6 1
How much does the cost differ by state?
DRG872 %>% group_by(State) %>% mutate(med=median(Paid)) -> tmp
DRG872$State <- reorder(DRG872$State, tmp$med)
ggplot(DRG872, aes(x=State, y=Paid)) +
geom_boxplot(color='red',fill=NA, aes(y=Charged)) +
geom_boxplot(fill=NA) + scale_y_log10(breaks=c(1000,2000,4000,8000,15000,30000,60000,120000,240000))
Let’s pick some ZIP code variables:
data(ZipDemography)
data(ZipGeography)
ZD <- select(ZipDemography, ZIP, Pop=Totalpopulation,
Over65=X65yearsandover,
College=Bachelorsdegreeorhigher,
Disabled=Disabilitystatuspopulation21to64years,
Income=Medianhouseholdincomedollars)
ZG <- select(ZipGeography, ZIP, Latitude, Longitude, LandArea)
Zips <- inner_join(ZD, ZG) %>% mutate(ZIP = as.character(ZIP))
## Joining by: "ZIP"
DRG872 <- mutate(DRG872, ZIP=as.character(ZIP)) %>% inner_join( Zips)
## Joining by: "ZIP"
ggplot( DRG872, aes(x=Over65/Pop, y=Charged)) +
geom_point() + scale_y_log10()
## Warning: Removed 161 rows containing missing values (geom_point).
ggplot( DRG872, aes(x=Income, y=Charged)) +
geom_point() + scale_y_log10()
## Warning: Removed 160 rows containing missing values (geom_point).
ggplot( DRG872, aes(x=LandArea, y=Charged)) +
geom_point() + scale_y_log10() + scale_x_log10()
## Warning: Removed 160 rows containing missing values (geom_point).
ggplot( DRG872, aes(x=Pop/LandArea, y=Charged)) +
geom_point() + scale_y_log10() + scale_x_log10()
## Warning: Removed 160 rows containing missing values (geom_point).
DRG872 <- transform(DRG872, Rgroup=ntiles(Charged,4))
ggplot(DRG872, aes(x=Longitude,y=Latitude,color=Rgroup)) +
geom_point() + facet_wrap(~ Rgroup,nrow=2) + geom_polygon( data=all_states, aes(x=long, y=lat, group = group),colour="white", fill=NA )
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
Fit log(Charge) to DRG and take the residual.
mod <- lm(log10(Charged) ~ DRG, data = SmallMed)
SmallMed$Resid <- resid(mod)
SmallMed <- mutate(SmallMed, ZIP=as.character(ZIP)) %>% inner_join( Zips)
## Joining by: "ZIP"
SmallMed$State <- reorder(SmallMed$State, SmallMed$Resid)
ggplot( SmallMed, aes(x=State, y=Resid)) + geom_boxplot()
ggplot( SmallMed, aes(x=Pop/LandArea, y=Resid)) +
geom_point(alpha=.02) + scale_x_log10()
## Warning: Removed 12230 rows containing missing values (geom_point).
SmallMed <- transform(SmallMed, Rgroup=ntiles(Resid,9))
ggplot(SmallMed, aes(x=Longitude,y=Latitude,color=Rgroup)) +
geom_point() + facet_wrap(~ Rgroup,nrow=3) + geom_polygon( data=all_states, aes(x=long, y=lat, group = group),colour="white", fill=NA )
## Warning: Removed 12 rows containing missing values (geom_point).
## Warning: Removed 42 rows containing missing values (geom_point).
## Warning: Removed 57 rows containing missing values (geom_point).
## Warning: Removed 67 rows containing missing values (geom_point).
## Warning: Removed 74 rows containing missing values (geom_point).
## Warning: Removed 54 rows containing missing values (geom_point).
## Warning: Removed 43 rows containing missing values (geom_point).
## Warning: Removed 51 rows containing missing values (geom_point).
## Warning: Removed 31 rows containing missing values (geom_point).
mod2 <- lm(Resid ~
I(Pop/LandArea) + Income + I(College/Pop), data=SmallMed )
rsquared(mod2)
## [1] 0.04656275
anova(mod2)
## Analysis of Variance Table
##
## Response: Resid
## Df Sum Sq Mean Sq F value Pr(>F)
## I(Pop/LandArea) 1 139.7 139.746 3237.53 < 2.2e-16 ***
## Income 1 139.2 139.156 3223.86 < 2.2e-16 ***
## I(College/Pop) 1 9.5 9.503 220.16 < 2.2e-16 ***
## Residuals 136814 5905.5 0.043
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mod3 <- lm(Resid ~ I(Pop/LandArea) + Income + I(College/Pop) + State + Latitude, data=SmallMed )
rsquared(mod3)
## [1] 0.4294642
anova(mod3)
## Analysis of Variance Table
##
## Response: Resid
## Df Sum Sq Mean Sq F value Pr(>F)
## I(Pop/LandArea) 1 139.7 139.746 5408.58 < 2.2e-16 ***
## Income 1 139.2 139.156 5385.75 < 2.2e-16 ***
## I(College/Pop) 1 9.5 9.503 367.79 < 2.2e-16 ***
## State 43 2366.6 55.037 2130.10 < 2.2e-16 ***
## Latitude 1 5.1 5.057 195.70 < 2.2e-16 ***
## Residuals 136770 3533.8 0.026
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Can we capture the effect of State with demographic variables?